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Abstract- In this paper parallel computation of manipulator forward dynamics 
is investigated. Considering three classes of algorithms for the solution of 

the problem, that is, the 0(n), the 0(n ), and the 0(n ) algorithms, 
parallelism in the problem is analyzed. It is shown that the problem belongs 

to the class of NC and that the time and processors bounds are of 0(log 2 n) and 

0(n ), respectively. However, the fastest stable parallel algorithms achieve 
the computation time of 0(n) and can be derived by parallelization of the 

0(n ) serial algorithms. Parallel computation of the 0(n ) algorithms requires 
the development of parallel algorithms for a set of fundamentally different 
problems, that is, the Newton-Euler formulation, the computation of the 
inertia matrix, decomposition of the symmetric, positive definite matrix, and 
the solution of triangular systems. Parallel algorithms for this set of 
problems are developed which can be efficiently implemented on a unique 
architecture, a triangular array of n(n+l)/2 processors with a simple 
nearest -neighbor interconnection. This architecture is particularly suitable 
for VLSI and WSI implementations. The developed parallel algorithm, compared 
to the best serial 0(n) algorithm, achieves an asymptotic speedup of more than 
two orders-of -magnitude in the computation the forward dynamics. 


I. INTRODUCTION 

The manipulator forward dynamics problem concerns the determination of 
the motion of the mechanical system resulting from the application of a set of 
joint forces/torques which is essential for dynamic simulation. The motivation 
for devising fast algorithms for forward dynamics computation stems from 
applications which require extensive off-line simulation as well as 
applications which require real-time dynamic simulation capability. In 
particular, for many anticipated space teleoperation applications, a faster- 
than-real-time simulation capability will be essential. In fact, in the 
presence of unavoidable delay in information transfer, such a capability would 
allow a human operator to preview a number of scenarios before run-time [1]. 

The forward dynamics problem can be stated as follows: Given the vectors 
of actual joint positions (Q) and velocities ( Q ) , the external force (f E ) and 

moment (n £ ) exerted on the End-Effector (EE), and the vector of applied joint 
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forces/torques (t), find the vector of Joint accelerations (Q). Integrating 
the vector of joint accelerations leads to the new values for Q and Q, and the 
process is then repeated for the next t. The first step in computing the 
forward dynamics is to derive a linear relation (for the given manipulator 
configuration described by the vector of joint positions) between the vector 
of Joint accelerations and the vector of applied inertial forces/torques. 

Given the dynamic equations of motion as 

A(Q)Q + C(Q, Q) + G(Q) + J^QIF = t (1) 

and defining the bias vector b as 

b = C(Q, Q) + G(Q) + J t (Q)F £ (2) 

the linear relation is derived: 

A(Q)Q = x-b = T (3) 

where Q, Q, and Q are nxl vectors and F , a 6x1 vector, is a combined 

representation of f and n E> A (Q) is an nxn symmetric, positive definite, 

inertia matrix, and J is the 6xn Jacobian matrix (t denotes matrix transpose). 
The bias vector b represents the contribution due to coriolis and centrifugal 
terms C(Q,Q), gravitional terms G(Q), and the external force and moment. 

Hence, in Eq. (3), T is the nxl vector of applied inertia forces/torques. The 
bias vector b can be obtained by solving the inverse dynamics problem, using 
the Newton-Euler (N-E) formulation [2], while setting the vector of joint 
accelerations to zero. The computation of the vectors b and T represent the 
common first step in any algorithm for solving the forward dynamics problem. 

The proposed algorithms for the forward dynamics problem differ in their 
approaches to solving Eq. (3), which directly affects their asymptotic 
computation complexity. These algorithms can be classified as: 

1) The 0(n) algorithms [3]-[6] which, by taking a more explicit advantage of 
the structure of problem, e.g. , by using the Articulated-Body Inertia 1 3 ] — 1 4 ] 
and recursive factorization and inversion of the inertia matrix [5] -[6], solve 
Eq. (3) in 0(n) steps without explicit computation and inversion of the 
inertia matrix. 

2) The 0(n ) conjugate gradient algorithms [7, 10] which iteratively solve 
Eq. (3) without explicit computation and inversion of the inertia matrix. The 
conjugate gradient algorithm is guaranteed to converge to the solution in at 
most n iterations which, given the 0(n) computational complexity of each 

2 

iteration, leads to an overall 0(n ) computational complexity. 

3) The 0(n 3 ) algorithms [7] which solve Eq. (3) by explicit computation and 
inversion of the inertia matrix, leading to an 0(n ) computational complexity. 

However, any analysis of the relative efficiency of these algorithms 
should be based on the realistic size of the problem, i.e. , the number of 
Degree-Of -Freedom (DOF), rather than the asymptotic complexity. In fact, the 

comparative study in [3] -[4] shows that the 0(n ) Composite Rigid-Body 
algorithm is the most efficient for n less than 12. It should be pointed out 
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that efficiency of the 0(n ) and 0(n ) algorithms has been recently improved 

3 

[9]— [10] . However, despite these improvements, even the fastest 0(n ) 
algorithm is far from providing the efficiency required for real-time or 
faster-than-real-time simulation. This observation clearly suggests that the 
exploitation of a high degree of parallelism in the computation is the key 
factor in achieving the required efficiency. 

The analysis of the efficiency of the different algorithms for parallel 
computation is more complex than that for serial computation. In the next 
section, the three classes of algorithms are analyzed based on their 

3 

efficiency for parallel computation and it is shown that the 0(n ) algorithms 
are also the most efficient for parallel computation. However, parallelization 

3 

of the 0(n ) algorithms represents a challenging problem since it requires the 
development of parallel algorithms for computation of a set of fundamentally 
different problems, i.e., the N-E formulation, the inertia matrix, the 
factorization of the inertia matrix, and the solution of triangular systems. 

Lee and Chang [15] were first to investigate the computation of the 

forward dynamics by parallelization of the 0(n 3 ) algorithms. Considering an 
SIMD architecture with n processors interconnected through a generalized-cube 
network, a modified version of their 0(log 2 n) algorithm in [16] and an 0(n) 

parallel version of the Composite Rigid-Body algorithm were developed for 

2 

computation of the N-E formulation and the inertia matrix. A parallel 0(n ) 
Cholesky algorithm and the 0(n) Column-Sweep algorithms were also proposed for 
the factorization of the inertia matrix and the solution of the resulting 

2 

triangular systems, leading to an 0(n ) complexity of the overall computation. 
However, the main drawbacks of the proposed algorithms reside in the 

complexity of the required interconnection network and the 0(n ) communication 
complexity which mainly results from the excessive data alignment needed for 
different algorithms. 

In this paper, we present a set of efficient parallel algorithms for the 

3 

computation of the forward dynamics, using the 0(n ) algorithms, which can be 
implemented on a two-dimensional array of n(n+l)/2 processors with a nearest 
neighbor interconnection. The overall of communication complexity, even with 
such simple interconnection structure, is limited to 0(n) and no additional 
data alignment between the computation of the different algorithms is 
required, which further reduces the overhead in the parallel computation. 

A new algorithm for computation of the inertia matrix is developed which, 
though not efficient for serial processing, achieves the best performance for 
parallel computing in terms of both computation and communication complexity 
while demanding simple architectural features for its implementation. The 
parallel algorithm for computing the inertia matrix achieves the time lower 
bound of 0( log 2 n)+0( 1 ) on the processor array. Synchronous data-flow parallel 

algorithms are also developed for factorization of the inertia matrix and the 
solution of the resulting triangular systems on the processor array. 
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This paper is organized as follows. In Section II, parallelism and time and 
processors bounds in the computation of the forward dynamics are investigated. 
In Section III, parallel algorithm for computation of the inertia matrix is 
developed. In Section IV, parallel computation of the bias vector and the 
linear system solution are briefly discussed. Finally, some concluding remarks 
are made in Section V. 

II. PARALLELISM IN FORWARD DYNAMICS COMPUTATION 

A. Time and Processor Bounds in the Computation 

The analysis of time and processors bounds in parallel computation of a 
given problem is of fundamental theoretical importance. It can determine the 
inherent parallelism in the problem and the bound on the number of processors 
required for exploiting maximum parallelism and achieving the time lower 
bound in the computation. However, besides the theoretical importance, it can 
also provide, as is the case for forward dynamics problems, useful insights 
into devising more practical and efficient parallel algorithms (in the sense 
of both computation time and number of processors) for the problem. 

Let P denote the class of problems that can be solved sequentially in a 
time bounded by a polynomial of the input size, n. Also, let NC (for "Nick’s 
Class" [18]) stand for the class of problems that can be solved in parallel 

in a time of 0(log 2 n), for some constant k, with a number of processors 

bounded by a polynomial of n. One open question regarding the complexity of 
parallel algorithms is whether P = NC, which is thought to be very unlikely 

[19] . It is clear that NC £ P. For k = 1, the time of 0( log 2 n)+0(l ) represents 

the natural time lower bound in the computation. However, most of the 
kinematic and dynamic problems in robotics belong to the class of NC [8]. 
Furthermore, it is possible to devise parallel algorithms which achieve the 
time lower bound of 0(log 2 n)+0(l ) In solving these problems [8,14,16,17]. In 

the following, we study the time and processors bounds in the computation of 
the forward dynamics by different algorithms. 

Using the N-E formulation, the bias vector can be computed in a time of 
0(log 2 n)+0(l) with 0(n) processors [15]— [16] . This implies that the time and 

processors bounds in the forward dynamics computation are determined by those 
in the solution of Eq. (3). Note that, with 0(n) processors, the integration 
of the computed joint accelerations can be performed in a time of 0(1). 

The solution of Eq. (3) by the 0(n) algorithms results in a set of first- 
order nonlinear recurrences which can be represented (at an abstarct level) as 

X = C + <p (X )/<f> (X ) = C + 0(X ) (4) 

1 1 *2 1 + 1 *i i+i i r 1+1 

where C is constant, <p and </> are polynomials of first and second degree, 

and deg <p = max (deg 0 , deg <f> =2. It is well-known that, regardless of the 

number of processors, the computation of nonlinear recurrences of the form of 
Eq. (4) and with deg <p>\ can be speeded up only by a constant factor 

[20] - [21 ] . This is due to the fact that the data dependency in nonlinear 
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recurrences and, particularly, those containing division, is stronger than in 
linear recurrences [22]. Hence, the parallelism in the 0(n) algorithms is 
bounded, that is, their parallelization leads to the 0(n) algorithms which are 
faster than the serial algorithm only by a constant factor. Note that a rather 
simple model was used for presentation of the nonlinear recurrences of the 
0(n) algorithms while they are far more complex than those usually studied 
in literature, e. g. , in [ 21 ] — [ 22 ] (see [8] for a more detailed discussion). 

For the conjugate gradient algorithms in [7], [10], the computation of 
each iteration, as is shown in [15], can be done in a time of 0(log 2 n) with n 

processors, leading to the 0(nlog z n) parallel algorithms. This implies that 

the parallelism in conjugate gradient algorithm is unbounded. Asymptotically, 
however, the parallel conjugate gradient algorithms are slower than the best 
serial algorithms, the 0(n) algorithms, for the solution of the problem. 

2 

The inertia matrix can be computed in 0(log n)+0(l) steps with 0(n ) 
processors [8], [11], [13], The implication of this result is that it further 
reduces the analysis of the time and processors bounds in the forward dynamics 
problem to that in a more generic problem, the linear system solution. Csanky 

2 4 

has shown that the linear system can be solved in 0(log 2 n) steps with 0(n ) 

processors [23] . This implies that the forward dynamics problem belongs to 
the class of NC. Note that, using Cramer’s rule, the linear system solution 
can be computed in O(log n) steps with 0(n! ) processors [20]. But such a 
result has neither theoretical nor practical importance. 

However, Csanky’ s algorithm is unpractical since, besides using too many 
processors, it is numerically unstable [25]. The best stable algorithms for 

linear system solution achieve a time of 0(n) with 0(n 2 ) processors [24] -[25], 

Hence, parallelization of the 0(n 3 ) algorithms results in the stable 0(n) 

2 

parallel algorithms with 0(n ) processors, which indicates an unbounded 
parallelism. 

The above analysis shows that the forward dynamics problem belongs to the 
class of NC and that the best known upper bounds on the time and processors 

are 0(log 2 n) and 0(n 4 ), respectively. Practically, however, the fastest (and 

stable) parallel algorithm for its computation is of 0(n). With respect to 
these results the main question is, given the fact that both the serial 0(n) 

and 0(n 3 ) algorithms result in the 0(n) parallel algorithms, which one is more 
efficient for parallelization? 

Let a^n+0^ denote the polynomial complexity of the serial 0(n) algorithms. 

There is a limited parallelism in both coarse grain and fine grain (in 
matrix-vector operation) forms in these algorithms [8]. Exploitation of this 
parallelism leads to the parallel algorithms with polynomial complexity as 
a z n+P 2 where, due to the limited parallelism, is reduced to a ^ only by a 

small factor. Furthermore, exploitation of both coarse and fine grain 

3 

parallelism requires additional architectural complexity. For the 0(n ) 
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algorithms, the polynomial complexity of the resulting parallel 0(n) algorithm 
is of the form a n+y [log n]+£ where a is smaller than a by more than two 

orders-of -magnitude. As a result, while the algorithm is asymptotically faster 
than the serial 0(n) algorithms and their parallel versions by a high constant 
factor, it is also more efficient for small n. The price to be paid for this 

efficiency, of course, is an architecture with 0(n 2 ) processors. However, the 
efficiency of the parallel algorithm and the suitability of the architecture 

for VLSI and WSI implementation strongly support the choice of 0(n 3 ) 
algorithms for parallel computing. 


III. PARALLEL COMPUTATION OF INERTIA MATRIX 

A. Basic Algorithms for computation of inertia matrix 

From Eq. (3) the elements of the inertia matrix can be computed as 


for the condition given by 

Q, = 1 and Q k3tl =0 For k =1, 2 n (6) 

Two physical interpretations can be thought for the above condition, with each 
interpretation leading to a distinct class of algorithms as 

1) The first i-1 links do not have any motion, that is, they are static, and 
the accelerations and the forces/torques of the last n-i+1 links result from 
the unit acceleration of link i. This interpretation leads to the first class 
of algorithms, designated as the class of Newton-Euler Based (NEB) algorithms, 
in which the diagonal and lower off-diagonal elements of the inertia matrix 
are computed. In (7] an algorithm of this class is presented, designated as 
the Original NEB (ONEB) algorithm, which computes the inertia matrix by 
successive applications of the N-E formulation. 

2) The last n-i+1 links can be considered as a single composite rigid 
system, since they do not have any relative motion, which is accelerating in 
space, leading to the exertion of forces and moments on the first i-1 static 
links. This interpretation leads to the second class of algorithms, designated 
as the class of Composite-Rigid Body (CRB) algorithms, in which the diagonal 
and upper off-diagonal elements of the inertia matrix are computed. In [7] an 
algorithm of this class, designated as the Original CRB (OCRB) algorithm, is 
presented in which the center of mass and the first and the second moment of 
mass with respect to the center of mass of a set of composite systems are 
computed. 

The comparative study in [7] shows that the OCRB algorithm achieves a 
significantly greater efficiency over the ONEB algorithm. In [8]-[9], we have 
developed an algorithm, designated as the Variant of CRB (VCRB) algorithm, 
which avoids the redundancies in the OCRB algorithm and represents the most 
efficient algorithm (to date) for computing the inertia matrix. Note that, 
however, due to the symmetry of the problem, both interpretations and hence 
both classes of algorithms should lead to the same results and computational 
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Fig. 1. Comparison of different algorithms for computation of inertia matrix 

efficiency. In [8] -[9], we have shown that, by introducing or reducing the 
redundancy in the computation, the algorithms of the two classes can be 
transformed to one another and, particularly, to the most efficient one, the 
VCRB algorithm. Figure 1 shows the relative serial efficiency of and 
redundancy in different algorithms. 

Although the results presented in Fig. 1 answer the question of the 
serial efficiency of different algorithms, it does not indicate which 
algorithm provides the most suitable features for parallelization. For serial 
processing, removing any redundancy increases the computational efficiency. 

For parallel processing, however, depending on its impact on the data 
dependency in the computation, this may increase or decrease the efficiency. 
The fact that arbitrary algorithms can be developed by introducing or removing 
different types of redundancy in the computation represents an additional 
degree-of -freedom that can be exploited to derive an algorithm which, though 
perhaps not efficient for serial computing, is the most suitable for 
parallelization. In 18], [12], we have shown that the NEB algorithms are more 
suitable for parallel computing than the CRB algorithms. In fact, they not 
only achieve a better computational complexity (in the parallel sense) but 
also require a less complex communication and synchronization mechanism. This 
better efficiency for parallelization mainly results from the fact that the 
evaluation of the columns of the inertia matrix by the NEB algorithms is order 
Independent and hence can be done in parallel. 

B. A Variant of Newton-Euler Based (VNEB) Algorithm 

Four different types of redundancy can be recognized in the ONEB algorithm, 
which can be eliminated, respectively, by [8], [9], [13]: 

1) Choosing a more suitable coordinate frame for projection of the equations. 

2) Optimizing the N-E formulation for the condition given by Eq. (6). 

3) Using a more efficient variant of the optimized N-E formulation. 

4) Introducing a two-dimensional recursion in the computation. 

Note that the first redundancy resides in the extrinsic equations and 
results from the choice of coordinate frame for projection of the intrinsic 
equations while the second, the third, and the fourth redundancies reside in 
the intrinsic equations and are Inherent in the formulation. As stated before, 
by removing all redundancies in the intrinsic equations, the ONEB algorithm 
can be transformed to the VCRB algorithm. However, removing any type of 
redundancy in the NEB algorithms, as far as the order independence property 
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is preserved, will also increase the efficiency of their parallel versions. 

In this regard, only the removal of the fourth redundancy, which leads to the 
introduction of a two-dimensional recursion in the computation, results in the 
loss of the order independence property of the algorithm. In the following, a 
Variant of the NEB algorithm, designated as the VNEB algorithm, is presented 
which is developed by removing the first three redundancies. 

The derivation of the Variant of N-E Based (VNEB) algorithm is fully 
discussed in [8] , [9] , [12] , [13] . Here, for the sake of completeness, a brief 
description of the algorithm is given. The algorithm is presented by the 
intrinsic equations. In this paper, according to the Gibbs notation, vectors 
are underlined once and tensors (tensors of order 2) twice. Also, in order to 
simplify and, particularly, unify the derivation of the serial and parallel 


algorithms, a set of notations, given in Table I and Fig. 3, are used. The 
VNEB algorithm is then written as 

For i * 1, 2, .... n 

For j = i, i+1 n 

y(J,i)=Z(l) (7) 

V(J,i) = V(j-l,i) + &(j,i)xP(J, J-l) = w(J, i)xP(j, i) (8) 

F(J+l,j,i) = M( j)V( j, 1 ) + w(J, l)xH(J) (9) 

N(j+1, j.i) - K(j)«(j,i) (10) 

For J = n, n-1 i 

F(n+1, j.i) = F(j+l,j,i) + F(n+1, J+l, i) (11) 

N(n+1, j, i ) = N( j+l, j, i) + N(n+1, J+l, i ) + P( J+l, J)xF(n+l, j+l, i) (12) 
with F(n+l,n+l, i) = N(n+l,n+l,i) = 0 

a jt = Z(J).N(n+l,J+l,i) (13) 


C. Parallel Algorithm for Computation of Inertia Matrix 

The serial computational complexity of evaluating the inertia matrix is 
2 

of 0(n ). No serial algorithm can achieve a better asymptotic complexity 

since, given n inputs (joint positions), the evaluation of the 0(n 2 ) outputs, 

the elements of the inertia matrix, requires 0(n 2 ) distinct steps in the 
computation. Based on the VCRB algorithm, we have already shown that the 

inertia matrix can be computed in 0( log 2 n)+0( 1 ) steps with 0(n 2 ) processors 

[8], [11]. It is interesting to note that not only the same bounds on time and 
processors can be much more easily derived by parallelization of the VNEB 
algorithm but also the resulting parallel algorithm, compared to the parallel 
VCRB algorithm, reduces the coefficients on the polynomial complexity. 

For the implementation of the parallel algorithm achieving the time lower 
bound, we consider a two-dimensional array of n(n+l)/2 processor-memory 
modules represented as PR , for 1 = 1, 2, .... n and j = i, i+1 n 
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(Fig. 3 shows the array for n = 6). For the parallel algorithm, the equations 
are projected onto the EE coordinate frame, coordinate frame n+1. An n DOF all 
revolute joints manipulator is considered for which the joint variables are 
the joint angles, that is, = 0^. It is assumed that the joint variables 0^ 

(or S0j and C0^) and constant parameters Sa^, Ca^, J+1 P(j+l,j), J+1 H(j), 

J+1 K ( j ) , and M(J) reside in the memory of all processors of Row j. The 
analysis of the mapping the algorithm onto the architecture of Fig. 3 is fully 
presented in [12]. Here, due to the lack of space, we only give the a brief 
description of the algorithm in terms of its computational steps and cost. 

For the parallel algorithm, the ith column of the inertia matrix is 
computed by the processors of the ith column of the processor array. The fact 

that w(j,i) = Z ( i ) for j = i, i+1 n, implies that global communication 

of Z(i) among the processors of the ith column is required. This requirement 
can be avoided by introducing two recurrences as 


w(j, i) = w(J-l, i) = Z(i) 

(14) 

P( j, i ) = P(J-l.l) + PC J, j-1 ) 

(15) 

Equation (14) does not need any computation while, for the parallel algorithm, 
the computation of Eq. (15) is required. By computing Eqs. (14)— (15) as a set 
of coupled recurrences, the terms w(j-l,i) can be considered as the data 
associated with Eq. (15). Using such a scheme increases the communication 
complexity of parallel evaluation of Eq. (15) but will result in the global 
distribution of Z(i) among the processors of the ith column. The computation 
of the parallel algorithm is then performed as follows. 

Step 1: 

1) Parallel compute R(j+l,j) by all processors of the jth row. 


For j = 1, 2 n 


For i = 1, 2 j 


PR jt : R( j+1, j) 

(16) 

2) Parallel compute R(n+1, j) by processors of the ith column. 


For i = 1, 2 n 

For j = i, i+1, .... n 


For 7} = 1 step 1 until [log^n+l-j )] , Do 

(17) 

R(J+2 1 \ j) = R(n+1, j) 

j+2 T) >j+2 1,_1 an+l 

R( j+Z 7 *, j) = R(n+l,j) = R(n+1, j+2 11 " 1 )R( j+2™, j) 

j+2 1 Wn>j+2 7r ' 1 

R(j+2 T) ,j) = R(j+2 1) , j+2 7) " 1 )R(J+2™, j) 

n+l>j+2 T, >j+2 Trl 


End Do 


3) Rotate R(n+1, j) by processors of Row j to the processors of Row j-1. 

For j = 1, 2, . . . , n 

For i = 1, 2 j 

PR : R(n+1, j+1 ) (18) 
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with R(n+l,n+l) = U (Unit Matrix) 

Note that, as the result of the above data transfer, both the terms R(n+1, j) 
and R(n+1, j+1) reside in the memory of all the processors of the jth row. 

4) Parallel compute n+1 Z(j), n+1 P(j+l,j), n+1 H(j), and n+1 K(j) by all the 


processors of 

the jth row. 


j = 

1, 

2 ... 

. * n 


For 

1 = 

= 1. 2 

» • • • » j 



a) 

PR : 
J1 

n+ 1 z(j) = R(n+1, j) J Z(j) 

(19) 



with 

4 * 
r— ♦ 

o 

o 

II 

s 

N 



b) 

PR : 
J» 

n+1 P(J + l.J) = R(n+1, j+1 ) J+1 P(j+l, J) 

(20) 



with 

J+1 PC J+1 . j ) = la i dSoc i 



c) 

PR : 
J» 

n+1 H(j) = R(n+1, j+l) J+1 H(j) 

(21) 


d) 

PR : 
Jl 

n+1 K(j) = R(n+1, j+1 ) J+1 K( j)R( j+1, n+1 ) 

(22) 


Note that for the processors of the nth row Eqs. (21 )— (22 ) do not need any 

computation since the terms n+1 H(n) and n+1 K(n) are given constant parameters. 
As the result of the computation of Step 1, all the vectors and the tensors 
are projected onto the coordinate frame n+1. In the following, the absence of 
superscripts denotes that the computations are performed in this frame. 


Step 2: 

1) Parallel compute P(j,i) and w(j,i) by processors of the ith column. 
For i = 1, 2, . . . , n 

For J = i, i+1 n 


For i) = 1 step 1 until flog^n+l-j)] , Do 
w(j+2 1) ,i) = wtj^.i) = Z(i) 

P( j+2 1 *, j) = P(i,j) 

P(j+2 1) ,j) = P(i, j) = P(i,j+2™) + P(J+2™, j) 
P(j+2 T, ,j) = P(j+2 1 ’, j+2 1 *" 1 ) + PU+Z^.j) 


(23) 

j+2 T, >j+2 11 " 1 2:i 
j+2 7, a:i>j+2 11 " 1 (24) 
i>j+2 1,, >j+2 T, ” 1 


End Do 


2) Parallel compute V(j,i), F(j+l,j,i), N(j+l,j,i) by processors of the ith 
column. 

For 1 = 1, 2 n 

For j = i, i+1, . . . n 

PRj,: V(j,i) = «(j,i)xP(j,i) (25) 
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(26) 


PR : F( j+1, j, i) = c>(J,i)xH(J) +M(j)V(j,i) 

PR jt : N( j+1 , J, i ) = K( j)w( j, 1) + H(j)xV(j,i) (27) 

Step 3: 

1) Parallel compute F(n+l,j,l) and N(n+l,j,l) by processors of the ith column. 
For 1 = 1, 2, . . . , n 
For J = 1, 1+1, .... n 

For v - 1 step 1 until flog^n+l-j)] , Do (28) 

F(J+2 1) , J. 1 ) = F(n+1, J,l) j+2^>j+2 1 * 1 £n+l 

F(J+2 t, ,J.l) = F(n+1, j, 1) = F(j+2 1 \j+2 1l) " 1 ,l)+F(j+2 7, “ 1 , j.l) j+2 T *>n+l> j+2 T) " 1 

F(j+2 1) ,J,1) - F(j+2 1) , j+2 T, " 1 ,l)+F(j+2 7) " 1 , j.l) n+l> j+2 T, > j+2 T> 1 

End_Do 

For 1} = 1 step 1 until [log^n+l-j)] , Do (29) 

N(J+2 T \j,l) = N(n+1, j, 1) j+2 7) >j+2 1} ^n+l 

N( J+2 T *. J, 1 ) = N(n+1, J, 1) = N(n+1, j+2 T, ' 1 ,l)+N(j+2 T) ' 1 , j,l) + 

P(J+2 1 *" 1 , j)xF(n+l, j+2 1) ' 1 ,i) J+2 T, sn+l>j+2 T, ‘ 1 

N( J+2 1 *, J, 1) = N(J+2 1) , J+2 n " l ,i)+N(j+2 T, “ 1 , j, 1) + 

P( J+2 1 *” 1 , j)xF( J+2 1 *, j+2 1 *" 1 , 1 ) n+l>j+2 7, >j+2 T,_1 


End_Do 

2) Parallel compute a jt by PR Jt - 

For 1 = 1, 2 n 

For J = 1 , 1+1 , . . . n 

PR jt : a « Z(J).N(n+l,J.i) ( 3 °) 

As stated before, the time lower bound In, as well as the computational 
cost of, parallel evaluation of the Inertia matrix, using the VNEB algorithm, 
Is determined by that of the parallel evaluation of the first column of the 
Inertia matrix. In other words, the computational cost of the n recurrences in 
Eqs. (17), (23), (24), (28), and (29) Is determined by that of the largest 
ones, i.e., for 1=1, which are of size n. Furthermore, the computational 

cost of all the 0(n 2 ) terms in Eqs. (16), ( 1 9 ) — ( 22 ) , (25)-(27), and (30) Is 

determined by the cost of one term since for each column n terms are computed 
in parallel and the computation for n columns, as will be discussed later, is 
overlapped. Let m and a denote the cost of multiplication and addition, 
respectively. The computational cost of the parallel algorithm is then 
evaluated as follows: 
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Step 1: The cost of Eq. (16) is 4m; The cost of Eq. (17) is (27m+18a) flog^l ; 

Eq. (18) represents a simple data rotation; Eq. (19) does not need any 
computation; The cost of Eqs. (20), (21), and (22) is (9m+6a), (9m+6a), and 
(54m+36a), respectively. The cost of this step is (27m+18a) [log 2 n] + (76m+48a) . 

Step 2: Eq. (23) does not need any computation; The cost of Eq. (24) is 
(3a) |"log 2 n] ; The cost of Eqs. (25), (26), and (26) is (6m+3a), (9m+6a), and 

(15m+12a), respectively. The cost of this step is (3a) [log 2 n] + (30m+21a) . 

Step 3: The cost of Eqs. (28) and (29) is (3a) [log^l and (6m+9a) flog^] , 

respectively; The cost of Eq. (30) is (3m+2a). The cost of this step is 
(6m+12a) |"log 2 n"| + (3m+2a). 

Adding the cost of Steps 1-3, the computation cost of the algorithm is 
obtained as (33m+33a) flog^^ + (109m+71a) . As stated before, mapping the 

developed parallel algorithm onto the processor array is fully presented in 
I 12] where it is shown that, even using a simple nearest neighbor 
interconnection structure, a communication complexity of 0(n) can be achieved. 
Also, the mechanisms for global and local synchronization for the processor 
array are presented. Figure 4 shows the resulting distribution of the 
elements of the inertia matrix among the processors. 


IV. PARALLEL COMPUTATION OF N-E FORMULATION AND SOLUTION OF LINEAR SYSTEM 

As stated before, the bias vector can be computed by evaluating the N-E 
formulation while setting the vector of joint accelerations to zero. We use 
the parallel algorithm presented in [15] for computing the bias vector. This 
computation is performed by the processors of the first column with the 
equations being projected onto the frame n+1. Therefore, the results of the 
computation of the first step except Eqs. (21)-(22) can be used while, similar 

to the terms n+l K(j) in Eq. (22), the terms n+1 J(j), for j = 1, 2 n, are 

also needed to be computed by the processors of the first column. The cost of 
evaluation of the vector T is then obtained as 15a|‘log 2 n'j + (141m+101a). With 

the nearest neighbor interconnection among the processors of the first column 
a communication complexity of 0(n) is achieved. In order to achieve the proper 
sequencing of the computation of the inertia matrix, the bias vector, and the 
linear system solution, a data-driven mechanism can be employed. That is, the 
processors of all columns, except the processors of the first column, by 
completion of the computation of their corresponding column of the inertia 
matrix enter the wait state while the processors of the first column start the 
computation of the bias vector and the vector T. The activity of the 
processors of column 2 through column n in linear system solution is triggered 
by receiving the corresponding data and by the completion of the computation 
of the vector T. 

Figure 4 shows the organization of the data resulting from the computation 
of the inertia matrix and the vector T. Given this data organization, we have 
developed synchronous data-flow parallel algorithms for the full solution of 
Eq. (3), that is, for decomposition of the inertia matrix, using the square- 
root-free variant of the Cholesky factorization, and the solution of the 
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resulting lower and upper triangular linear systems, which are presented in 
detail in [12]. The computation cost of the solution of Eq. (3) is obtained as 
(n-l)(3m+2a)+(lm+la), where the cost of division and multiplication is taken 
to be the same. A communication complexity of 0(n) is also achieved. 

Adding the cost of evaluation of the inertia matrix, the bias vector and 
the vector T, and the linear system solution, the computational cost of the 
forward dynamics problem is obtained as (3m+2a)n+(33m+48a) [log 2 h]+(238m+171a). 

The computational cost of the best serial 0(n) algorithm, the Articulated-Body 
algorithm [3]-[4], is evaluated as (380m+302a)n-(198m+173a) [15]. If the time 
of multiplication and addition is taken to be the same, then, for n = 6, the 
developed parallel algorithms achieve a speedup of 6 compared to the best 
serial 0(n) algorithm. Note, however, that as n increases, e.g. , for redundant 
manipulators, the speedup also significantly increases. To see this, let us 
write the computational cost of the serial 0(n) algorithm and the parallel 
algorithms as 682n+0(l) and 5n+0(log 2 n)+0(l), with the time of multiplication 

and addition taken to be the same. It can be seen that, asymptotically, the 
parallel algorithms achieve a speedup of more than two orders-of -magnitude. 

For small n, the computational cost of the parallel algorithms is dominated by 
the flog 2 n] -dependent and constant terms on the polynomial complexity. Hence, 

for small n, the computational complexity of the developed parallel algorithms 
can be practically considered as 0( log 2 n)+0( 1 ) . 


V. CONCLUSION 

We developed a set of parallel algorithms for computing the forward 
dynamics problem. These algorithms exploit a high degree of parallelism in 
the problem and achieve a significant speedup in the computation. Furthermore, 
they can be efficiently implemented on a two-dimensional array of processors 
with a nearest neighbor interconnection. This architecture is particularly 
suitable for practical implementation using VLSI and WSI technologies. Due to 
their simple architectural requirements, these algorithms, with some 
modifications, can also be efficiently implemented on rather more 
general-purpose architectures, e.g., a two-dimensional array of Transputers. 

A key factor in our approach to the parallel computation of the forward 
dynamics is the minimization of the resulting overhead. The overall 
communication complexity is of 0(n). The overhead is further minimized since 
there is no need for any data alignment between the computation of different 
algorithms and the intermediate data resulting from the different algorithms 
are generated and consumed within the array. Also, the final result of the 
computation, that is, the vector of joint accelerations, is computed by the 
processors of the first column. Therefore, they can be output using the same 
channels for inputting the data to the array (Fig. 5). This is particularly 
critical for VLSI and WSI implementation since, by using only n bidirectional 
Input/Output channels, the number of required pins is kept small. 
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Fig, 2. Link, Frames, and Kinematic and Dynamic Parameters. 


a d , & a Length, Distance, and Twist Angle of link i, respectively. 

q q & Q Position, Velocity, and Acceleration of joint i, respectively. 


M ( i ) 

J ( i ) 

H(i) 

K(i) 

Z(i) 

PC i , j) 

RU, j) 
w(i, j) 

VC i, j) 

F(k+l,i, J) 
N(k+1 , i, j) 


Mass of link i. 

Second moment of mass of link i about its center of mass (C^). 
First moment of mass of link i about point 0^. 

Second moment of mass of link i about point 0^. 

Axis of joint i 

Position vector from point 0^ to point 0^ 

A 3x3 matrix representing the orientation of coordinate frame j 
with respect to coordinate frame i. 

Angular acceleration of link i resulting from the unit 
acceleration of joint j. 

Linear acceleration of link i (point 0^ ) resulting from the unit 
acceleration of joint j. 

Force exerted on point 0^ due to the acceleration of links i 

through k, i.e., the links contained between points and 0^, 

resulting from the unit acceleration of joint j. 

Moment exerted on point 0^ due to the acceleration of link i 

through k, resulting from the unit acceleration of joint j. 


Table I. Notation used in derivation of serial and parallel algorithms 
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(a) 



Figure 3. A Two-Dimensional Processor Array (a) Data Input to 
Processor Array, (b) Distribution of Input Data Among Processors. 
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Figure 4. Organization of Data 

Resulting from Computation of the Figures. Data Output From 
Inertia Matrix and the Bias Vector. Processor Array. 
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